home *** CD-ROM | disk | FTP | other *** search
-
- PROFESSIONAL OPTICAL CHARACTER RECOGNITION - PRO-CR<tm>
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
- Copyright 1989, David P. Gray, Gray Design Associates
- All Rights Reserved
-
- Member, Association of Shareware Professionals
-
-
-
-
- -----------------------------[ C O N T E N T S ]-----------------------------
-
-
- 1. Specification.
- 2. System Requirements.
- 3. Files Distributed.
- 4. Revision History.
- 5. Future Versions.
-
- 6. USER GUIDE
- 6.1 Start-Up Procedure.
- 6.2 Feeding Input to PRO-CR<tm>.
- 6.3 Font Selection.
- 6.4 Non HP ScanJet Users.
- 6.5 Output Text File.
- 6.6 Performance.
- 6.7 Theory of Operation.
- 6.8 Menus.
- 6.9 Learn Mode.
- 6.10 Edit Mode.
- 6.11 Error Messages.
-
- 7. Site Licenses.
- 8. Comments to the Author.
- 9. Association of Shareware Professionals.
- 10. Miscellaneous.
-
-
-
-
- ----------------------------[ 1. SPECIFICATION ]----------------------------
-
- * Reads 8 to 30 point mono and proportional fonts.
- * Up to 200 words per minute.
- * Supports HP ScanJet or any scanner that supports TIFF files
- (not suitable for hand-scanners).
- * Training and font editing supported with EGA or VGA adapter.
- * Real-time viewing of text during normal and training scan.
- * Continuous scanning if auto document feeder attached.
- * Upgrade to Version 2 when available.
-
-
-
-
- -------------------------[ 2. SYSTEM REQUIREMENTS ]-------------------------
-
- PRO-CR<tm> performs Optical Character Recognition on an IBM PC or compatible.
- The program will run on an XT or AT, however an AT is strongly recommended
- due to the highly cpu-intensive nature of the program.
-
- A graphics adapter is not required for basic character recognition, but is
- needed for the training and font-edit functions. If a graphics adapter is
- used, it should be EGA or VGA. (CGA does not have the required resolution).
-
- The minimum memory requirement is about 80Kb (512Kb is recommended), although
- the program adapts itself to use as much conventional memory as available.
- Version 1 does not support expanded or enhanced memory. A temporary disk-file
- is used for any parts of the scanned image that will not fit into memory at
- once.
-
-
-
-
- --------------------------[ 4. FILES DISTRIBUTED ]--------------------------
-
- OCR.EXE : The PRO-CR<tm> program
- README.DOC : Important information
- HELP1.DOC : Text file used for online help
- HELP2.DOC : Text file used for online help
- HELP3.DOC : Text file used for online help
- MANUAL.DOC : This file
- COURIER.OCR : Font file
- ROMAN.OCR : Font file
- HELV.OCR : Font file
- IMAGE.TIF : Example TIFF file for processing
-
- NOTE: The text in the IMAGE.TIF file is in Courier.
-
-
-
-
- --------------------------[ 5. REVISION HISTORY ]---------------------------
-
- 1.0 05/16/89 : Baseline version.
- 1.01 05/18/89 : Fixed character editing in font edit
- function, caused by bug in compiler's
- loop optimizer.
- 1.02 05/31/89 : Don't reject TIFFs with no bits_per_
- sample tag. Assume a value of 1.
- 1.03 06/19/89 : Don't reject TIFFs with no resolution
- tags.
- 1.04 08/28/89 : Fixed bug in learn-mode.
- 1.05 11/29/89 : Fixed bug in Auto sheet feeder control.
-
-
-
-
- ----------------------------[ 6. FUTURE VERSIONS ]---------------------------
-
- Version 2.0 is currently in progress. Estimated shipping date is first quarter
- of 1990. The following is a list of features expected to be included:
-
- * Enhanced speed and recognition rate.
- * Font independance (just hit the start button).
- * Mouse support. Selection of areas to be scanned.
- * Mixed text and graphics blocks for desktop publishers.
- * Direct support of Logitech hand-scanner.
- * Ability to handle compressed TIFFs, PCX and MSP formats.
-
-
-
-
- -------------------------[ 6. U S E R G U I D E ]------------------------
-
-
-
-
- -------------------------[ 6.1 START-UP PROCEDURE ]-------------------------
-
- From the dos prompt, type: ocr
-
-
-
-
- -----------------------[ 6.2 FEEDING INPUT TO PRO-CR ]----------------------
-
- There are 2 methods of supplying input to PRO-CR<tm>.
- 1. Direct scanning from an HP ScanJet.
- 2. Reading from a TIFF file produced by any other scanner.
-
- Direct scanning allows you to scan a single page if you have a flat-bed
- scanner only or optionally scan multiple pages if you have an automatic
- document feeder attached. Version 1 always scans entire pages.
-
- PRO-CR<tm> recognizes both mono-spaced and proportionally spaced fonts. It
- adjusts automatically to character size and will automatically switch fonts
- when more than one is selected.
-
- PRO-CR<tm> is trainable. A learning mode is provided to learn unrecognized
- shapes or new fonts.
-
-
-
-
- ---------------------------[ 6.3 FONT SELECTION ]---------------------------
-
- PRO-CR<tm> provides a number of standard fonts for selection. More than one
- font may be selected in cases where you are not sure what font is on the page
- to be processed, or if there is more than one font on the page. For cases
- where only one font appears on the page to be scanned, selecting this font
- will generally give more accurate results and faster times than selecting all
- the fonts. However, the penalty for selecting all fonts is not great and is
- probably the best thing to do if you are in any doubt.
-
- If you are not sure what a particular font looks like, use the font editing
- feature to see the shapes of the default supplied fonts. (See the chapter on
- the edit mode).
-
-
-
-
- ------------------------[ 6.4 NON HP SCANJET USERS ]------------------------
-
- Compatibility with non-HP scanners is made possible through the use of TIFF
- (tag image file format) files. Many scanners and desktop publishing programs
- use this standard file format. A resolution of 300 dots per inch gives a
- good compromise between accuracy and processing time. If the text you are
- scanning is large, over 12 points, you may wish to scan at a lower resolution,
- say 240 or 200 dpi to speed processing in PRO-CR<tm>. In general, though, the
- higher the resolution the better the accuracy.
-
- When reading from a TIFF file, PRO-CR<tm> looks for the file IMAGE.TIF
-
- Even if you do have an HP ScanJet you can still use it for cases when you do
- not wish to scan the whole page. Use the scanning program that came with the
- scanner to scan the part of the page containing the text you wish to process.
-
- Version 1 of PRO-CR<tm> does not read compressed TIFF files.
-
-
-
-
- --------------------------[ 6.5 OUTPUT TEXT FILE ]--------------------------
-
- Whether scanning direct or reading from the TIFF file, all processed output
- is directed to a plain ASCII text file, default name TEXT.SAV. Version 1
- does not support word processor attributes or file formats. You may change
- the the name of the output file in the "run" menu. Text is always appended
- to the file for each page scanned until you choose a new file name.
-
-
-
-
- -----------------------------[ 6.6 PERFORMANCE ]----------------------------
-
-
- 6.6.1 Font Size
- ~~~~~~~~~~~~~~~~~
- PRO-CR<tm> automatically adjusts itself to a range of point sizes within any
- document. The range is approximately 8 to 30 points. The low end depends
- on the quality of the document and the typeface used. These figures assume
- the image was scanned at 300 dots per inch. (The resolution used for the
- direct scanning mode).
-
- Learning mode allows a total of 12 fonts, with up to 90 shapes in each font
- and up to 3 fonts selectable simultaneously in learning or non-learning mode.
-
-
- 6.6.2 Processing Speed
- ~~~~~~~~~~~~~~~~~~~~~~~~
- With one font selected, PRO-CR<tm> will process text at approximately 200
- words per minute on a 20MHz 386 PC.
-
-
- 6.6.3 Error Rate
- ~~~~~~~~~~~~~~~~~~
- The error rate is dependent on the quality of the text being processed and
- on the number of characters that "run together". In general the mono-spaced
- fonts such as Courier are easiest and the Roman font is the hardest to
- accurately recognize. For cases where characters run together, the learning
- mode can be used to help recognition.
-
- With good quality type the error rate is approximately 95% to 99% for Courier
- and 90% to 95% for Roman and Helvetica.
-
-
-
-
- -------------------------[ 6.7 THEORY OF OPERATION ]------------------------
-
-
- 6.7.1 PRO-CR<tm> Uses Feature Extraction
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- With one proprietry global (topological) feature and two local features. The
- local features are optimized for the three supplied fonts. With all three
- fonts selected, good recognition is achieved on other non-stylized fonts via
- this combination of features. PRO-CR<tm> also includes a large number of
- ad-hoc positional and context sensitive tests.
-
-
- 6.7.2 Single Character Errors
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- PRO-CR<tm> will not correctly recognize single characters 100% of the time.
- For every character guessed wrong, the reason is usually to be found on the
- document. Broken characters, skewed lines, misplaced text, smudges, to name
- a few. Sometimes it is just bad luck (for the technically minded, every
- signal processing system involves some noise. In this case the noise is in
- the scanning conversion and is a function of the scan resolution. Some
- characters look very much alike (S vs 5, b vs h) and one pixel dropped from
- the wrong place and appearing in another place can cause mis-recognition.)
-
-
- 6.7.3 Run-Ons
- ~~~~~~~~~~~~~~~
- One of the biggest problems faced by Optical Character Recognizers are run-
- ons. The ultimate run-on is human hand writing in which all the characters
- are joined together. This kind of recognition is beyond the scope of most
- PC-based OCRs (including this one, except as follows:).
-
- PRO-CR<tm> recognizes mono-spaced fonts such as Courier and proportional fonts
- such as Helvetica and Roman. In tightly spaced proportional fonts many of the
- characters run into each other. (This can also happen in badly spaced mono-
- fonts.) The software only recognizes single objects and so gets very confused
- by characters joined together in this way. It attempts to split up such
- run-ons in an attempt to recognize them as two characters but will many times
- still fail. For cases where there are three characters it is almost certain
- to fail. The run-ons are rather dependant on the particular printer which
- printed the page and for this reason a learning mode is provided. This
- allows for learning unique shapes applicable to a particular document. It also
- provides a mechanism to learn a completely new font. Bear in mind that when
- learning a new font, best results are obtained with good clean type, 10 points
- or larger. Don't try working with any kind of script font where all
- characters flow together, you won't get very far !
-
-
-
-
- -----------------------------[ 6.8 SELECT MODE ]----------------------------
-
- The operator interface is implemented as a series of menu levels. Completion
- of one takes you to the next, selecting QUIT takes you to the previous (or
- back to DOS if at the first menu). The following is a list of menus and
- selections.
-
- MENU 1. (Select mode:)
- Select "Scan_mode" if you will be scanning direct or
- Select "File_mode" if you will be reading from your image.tif file.
-
- MENU 2. (Select font(s):)
- Select one or more fonts to be used when performing the ocr. The selected fonts
- are indicated by a check mark. Select the OK option to get to the run menu,
- menu 3.
-
- Menus 2a and 2b are for use when learning or editing a font. You can skip these
- menus during normal use and select OK to proceed to menu 3.
-
- Menu 3. (run)
- Select FILE NAME if you wish to change the name of the file which will be
- written to with the processed text. The default file is "text.sav". This is a
- plain ASCII text file which may be imported into any word processor or desktop
- publishing. If more than one scan is done, new information is appended to this
- file until you select a new file name.
-
- Select START for a single page scan or
-
- Select AUTO FEED for a multi page scan. This is only available when scanning
- directly, file mode will process everything in the file. Also, an automatic
- document feeder must be present and ready for use.
-
-
-
-
- -----------------------------[ 6.9 LEARN MODE ]-----------------------------
-
- MENU 2a. (Select font for learning:)
- Select the Learn option to select a font for learning or to add a brand new
- font for learning. Only one font can be learnt and is indicated by an "L"
- instead of a check mark. During the ocr you will be prompted for up to 3
- characters for any unrecognized characters. If you are not sure what the un-
- recognized text is, press return. It will be ignored.
-
- Some points to note about learning mode is none of the 3 fonts include run-ons
- or, in other words, combinations of characters which are joined together. The
- reason for this is that the shapes of the joined together characters are largely
- printer dependent and so, what might work well for one document, would not work
- for another. In addition, the more shapes that are added to the font library,
- the more chance there is of choosing the wrong shape.
-
- There are 2 uses for learning (training) mode:
- 1. When there is a large amount of scanning to be done, for example a book,
- and it is worthwhile creating a special font just for this one document. Do not
- try to learn to the 3 fonts supplied, they are write protected. Instead, add a
- new font and learn to this.
-
- 2. Another use for the learning mode is to learn a new font from scratch.
- In this case best results will be obtained if you supply the font in the form
- of an alphabet, characters spaced well apart and in a large point size, say 14
- or more. If necessary you can learn a completely new font just from the final
- copy to be processed but will not give the best results. The program will
- prompt less and less as it proceeds to learn the alphabet. It will often prompt
- for a character more than once. This is an indication of the variability of the
- characters scanned.
-
- Hints for learning:
- The learning mode uses any characters you give it to try to match new
- characters. In this way it should prompt you less and less as it learns the
- complete alphabet, eventually prompting only for joined or broken characters.
- However, you will find that on occasion it will prompt you for characters you
- have already entered. This is due to the fact that there is a recognition
- threshold set which is a compromise between recognizing a character that has
- not been learnt yet and prompting too often for characters already learnt.
-
-
-
-
- ------------------------------[ 6.10 EDIT MODE ]-----------------------------
-
- Use the edit mode to consolidate your learnt font, removing unwanted duplicate
- characters or runs and correcting any mistakes made when entering the string
- representation for the character shape. Do not try to enter a string for
- shapes that you do not recognize yourself, just hit return to skip to the next
- character during a learning session. Do not enter punctuation marks especially,
- these are handled with special algorithms. Some characters, such as o, u, v, x
- etc. are ambiguous with regard to case when viewed out of context. If you are
- unsure as to the case of a shape that the program prompts you with, either skip
- the character by entering return or simply enter the lower case version. The
- program has special algorithms for adjusting the case of such ambiguous
- characters.
-
- After a learning session, always run the program in non-learn mode using the new
- font to determine the results. You may use one font, for example one of the
- supplied fonts, while learning to a new font.
-
- MENU 2b. (Select font to edit:)
- Select the EDIT option to select a font for editing. Editing consists of
- deleting unwanted shapes from the font or changing the text which they
- represent. Follow the directions on the edit screen. Note that the default
- fonts supplied with PRO-CR<tm> are write protected so any attempt to learn to
- them or edit them will fail.
-
-
-
-
- ---------------------------[ 6.11 ERROR MESSAGES ]--------------------------
-
- The following error codes may be seen, to do with TIFF files.
-
- 1 : Could not find the image.tif input file.
- 2 : Non-Intel byte order. The tif file is possibly a Mac file.
- 3 : Wrong value for bits_per_sample tag.
- 4 : Compressed TIFF file. This version does not handle compressed.
- 5 : Wrong value for photometric_interpretation tag.
- 6 : Wrong value for fill_order tag.
- 7 : Wrong picture orientation.
- 8 : Wrong value for samples_per_pixel tag.
- 9 : Wrong value for minimum_sample tag.
- 10 : Wrong value for maximum_sample tag.
- 11 : Wrong value for planar_configuration tag.
- 12 : Missing bits_per_sample tag.
- 13 : Missing image_width tag.
- 14 : Missing image_length tag.
- 15 : Missing image_pointer tag.
- 16 : Missing X_resolution tag.
- 17 : Missing Y_resolution tag.
-
-
-
-
- -----------------------------[ 7. SITE LICENSE ]----------------------------
-
- COMPANIES please note that only ONE USER at ONE LOCATION may use and operate
- PRO-CR<tm>.
-
- Additional computers, users and locations should be registered separately,
- by volume, or by obtaining a site license.
-
- DISCOUNT RATES are offered to companies registering for a site license or by
- volume. Please write to Gray Design Associates, P.O. Box 333, Northboro,
- MA 01532, USA for a rate schedule.
-
-
-
-
- ------------------------[ 8. COMMENTS TO THE AUTHOR ]-----------------------
-
- Any feedback would be greatly appreciated. Please direct any comments to
- author personally via mail to David P. Gray, Gray Design Associates,
- P.O. Box 333, Northboro, MA 01532, USA.
-
-
-
-
- ----------------[ 9. ASSOCIATION OF SHAREWARE PROFESSIONALS ]---------------
-
- This software is produced by David P. Gray who is a member of the Association
- of Shareware Professionals (ASP). ASP wants to make sure that the shareware
- principle works for you. If you are unable to resolve a shareware-related
- problem with an ASP member by contacting the member directly, ASP may be able
- to help.
-
- The ASP Ombudsman can help you resolve a dispute or problem with an ASP member,
- but does not provide technical support for members' products. Please write to
- the ASP Ombudsman at P.O. Box 5786, Bellevue, WA 98006, USA or send a CompuServe
- message via easyplex to ASP Ombudsman 70007,3536.
-
-
-
-
- ----------------------------[ 10. MISCELLANEOUS ]---------------------------
-
- HP and ScanJet are registered trade marks of Hewlett Packard.
-
-
-
-
- ----------------------------[ END OF MANUAL.DOC ]----------------------------
-